Technical Articles

I have been as a co-writer and an editor of various articles. Most of the data scientists, developers, and software engineers are looking for a writer who can understand and search about their products in order to be able to write them clearly and precisely. Luckily, I was the one who wrote and edited various parts of different papers and I would like to share two of them with you.

1. Unsupervised Feature Selection for Noisy Data

Abstract

Feature selection techniques are enormously applied in a variety of data analysis tasks in order to reduce the dimensionality. According to the type of learning, feature selection algorithms are categorized to: supervised or unsupervised. In unsupervised learning scenarios, selecting features is a much harder problem, due to the lack of class labels that would facilitate the search for relevant features. The selecting feature difficulty is amplified when the data is corrupted by different noises. Almost all traditional unsupervised feature selection methods are not robust against the noise in samples. These approaches do not have any explicit mechanism for detaching and isolating the noise thus they can not produce an optimal feature subset. In this article, we propose an unsupervised approach for feature selection on noisy data, called Robust Independent Feature Selection (RIFS). Specifically, we choose feature subset that contains most of the underlying information, using the same criteria as the Independent component analysis (ICA). Simultaneously, the noise is separated as an independent component. The isolation of representative noise samples is achieved using factor oblique rotation whereas noise identification is performed using factor pattern loadings. Extensive experimental results over divers real-life data sets have showed the efficiency and advantage of the proposed algorithm. Full Article

2. Organization Component Analysis: The method for extracting insights from the shape of cluster

Abstract

Clustering analysis is widely used to stratify data in the same cluster when they are similar according to specific metrics. The process of understanding and interpreting clusters is mostly intuitive. However, we observe each cluster has unique shape that comes out of metrics on data, which can represent the organization of categorized data mathematically. In this paper, we apply novel topological based method to study potentially complex high-dimensional categorized data by quantifying their shapes and extracting fine-grain insights about them to interpret the clustering result. We introduce our Organization Component Analysis method for the purpose of the automatic arbitrary cluster-shape study without assumption about the data distribution. Our method explores a topology-preserving map of a data cluster manifold to extract the main organization structure of a cluster by the leveraging of the self-organization map technique. To do this, we represent self-organization map as graph. We introduce organization components to geometrically describe the shape of cluster and their endogenous phenomena. Specifically, we propose an innovative way to measure the alignment between two sequences of momentum changes on geodesic path over the embedded graph to quantify the extent to which the feature is related to a given component. As a result, we can describe variability among stratified data, correlated features in terms of lower number of organization components. We illustrate the utilization of our method by applying it to two quite different types of data, in each case mathematically detecting the organization structure of categorized data which are much profounder and finer than those produced by standard methods.

Note: the full article will be uploaded in here in 2 months Full Article